Introduction

This IPython notebook illustrates how to refine the results of matching using triggers.

First, we need to import py_entitymatching package and other libraries as follows:



In [2]:

    
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the (sample) input tables for matching purposes.



In [3]:

    
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'



In [5]:

    
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')

# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
S.head()









    Out[5]:







  
    
      
      _id
      ltable_id
      rtable_id
      ltable_title
      ltable_authors
      ltable_year
      rtable_title
      rtable_authors
      rtable_year
      label
    
  
  
    
      0
      0
      l1223
      r498
      Dynamic Information Visualization
      Yannis E. Ioannidis
      1996
      Dynamic information visualization
      Yannis E. Ioannidis
      1996
      1
    
    
      1
      1
      l1563
      r1285
      Dynamic Load Balancing in Hierarchical Parallel Database Systems
      Luc Bouganim, Daniela Florescu, Patrick Valduriez
      1996
      Dynamic Load Balancing in Hierarchical Parallel Database Systems
      Luc Bouganim, Daniela Florescu, Patrick Valduriez
      1996
      1
    
    
      2
      2
      l1514
      r1348
      Query Processing and Optimization in Oracle Rdb
      Gennady Antoshenkov, Mohamed Ziauddin
      1996
      prospector: a content-based multimedia server for massively parallel architectures
      S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader
      1996
      0
    
    
      3
      3
      l206
      r1641
      An Asymptotically Optimal Multiversion B-Tree
      Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger
      1996
      A complete temporal relational algebra
      Debabrata Dey, Terence M. Barron, Veda C. Storey
      1996
      0
    
    
      4
      4
      l1589
      r495
      Evaluating Probabilistic Queries over Imprecise Data
      Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar
      2003
      Evaluating probabilistic queries over imprecise data
      Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar
      2003
      1

Use a ML Matcher to get Predictions

Here we will purposely create a decision tree matcher that does not take the several features into account to show later how triggers can be used to refine the model.



In [6]:

    
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']



In [7]:

    
# Create a Decision Tree Matcher
dt = em.DTMatcher(name='DecisionTree', random_state=0)



In [8]:

    
# Generate a set of features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)
feature_table









    Out[8]:







  
    
      
      feature_name
      left_attribute
      right_attribute
      left_attr_tokenizer
      right_attr_tokenizer
      simfunction
      function
      function_source
      is_auto_generated
    
  
  
    
      0
      id_id_lev_dist
      id
      id
      None
      None
      lev_dist
      <function id_id_lev_dist at 0x11b874aa0>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      1
      id_id_lev_sim
      id
      id
      None
      None
      lev_sim
      <function id_id_lev_sim at 0x11b874d70>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      2
      id_id_jar
      id
      id
      None
      None
      jaro
      <function id_id_jar at 0x11b874a28>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      3
      id_id_jwn
      id
      id
      None
      None
      jaro_winkler
      <function id_id_jwn at 0x11b874c80>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      4
      id_id_exm
      id
      id
      None
      None
      exact_match
      <function id_id_exm at 0x11b874de8>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      5
      id_id_jac_qgm_3_qgm_3
      id
      id
      qgm_3
      qgm_3
      jaccard
      <function id_id_jac_qgm_3_qgm_3 at 0x11b874e60>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      6
      title_title_jac_qgm_3_qgm_3
      title
      title
      qgm_3
      qgm_3
      jaccard
      <function title_title_jac_qgm_3_qgm_3 at 0x11b889050>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      7
      title_title_cos_dlm_dc0_dlm_dc0
      title
      title
      dlm_dc0
      dlm_dc0
      cosine
      <function title_title_cos_dlm_dc0_dlm_dc0 at 0x11b8890c8>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      8
      title_title_mel
      title
      title
      None
      None
      monge_elkan
      <function title_title_mel at 0x11b889140>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      9
      title_title_lev_dist
      title
      title
      None
      None
      lev_dist
      <function title_title_lev_dist at 0x11b8891b8>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      10
      title_title_lev_sim
      title
      title
      None
      None
      lev_sim
      <function title_title_lev_sim at 0x11b889230>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      11
      authors_authors_jac_qgm_3_qgm_3
      authors
      authors
      qgm_3
      qgm_3
      jaccard
      <function authors_authors_jac_qgm_3_qgm_3 at 0x11b8892a8>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      12
      authors_authors_cos_dlm_dc0_dlm_dc0
      authors
      authors
      dlm_dc0
      dlm_dc0
      cosine
      <function authors_authors_cos_dlm_dc0_dlm_dc0 at 0x11b889320>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      13
      authors_authors_mel
      authors
      authors
      None
      None
      monge_elkan
      <function authors_authors_mel at 0x11b889398>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      14
      authors_authors_lev_dist
      authors
      authors
      None
      None
      lev_dist
      <function authors_authors_lev_dist at 0x11b889410>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      15
      authors_authors_lev_sim
      authors
      authors
      None
      None
      lev_sim
      <function authors_authors_lev_sim at 0x11b889488>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      16
      year_year_exm
      year
      year
      None
      None
      exact_match
      <function year_year_exm at 0x11b889500>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      17
      year_year_anm
      year
      year
      None
      None
      abs_norm
      <function year_year_anm at 0x11b889578>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      18
      year_year_lev_dist
      year
      year
      None
      None
      lev_dist
      <function year_year_lev_dist at 0x11b8895f0>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      19
      year_year_lev_sim
      year
      year
      None
      None
      lev_sim
      <function year_year_lev_sim at 0x11b889668>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True



In [9]:

    
# We will remove many of the features here to purposly create a poor model. This will make it easier 
# to demonstrate triggers later
F = feature_table.drop([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
F









    Out[9]:







  
    
      
      feature_name
      left_attribute
      right_attribute
      left_attr_tokenizer
      right_attr_tokenizer
      simfunction
      function
      function_source
      is_auto_generated
    
  
  
    
      0
      id_id_lev_dist
      id
      id
      None
      None
      lev_dist
      <function id_id_lev_dist at 0x11b874aa0>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      15
      authors_authors_lev_sim
      authors
      authors
      None
      None
      lev_sim
      <function authors_authors_lev_sim at 0x11b889488>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      16
      year_year_exm
      year
      year
      None
      None
      exact_match
      <function year_year_exm at 0x11b889500>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      17
      year_year_anm
      year
      year
      None
      None
      abs_norm
      <function year_year_anm at 0x11b889578>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      18
      year_year_lev_dist
      year
      year
      None
      None
      lev_dist
      <function year_year_lev_dist at 0x11b8895f0>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True
    
    
      19
      year_year_lev_sim
      year
      year
      None
      None
      lev_sim
      <function year_year_lev_sim at 0x11b889668>
      from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...
      True



In [10]:

    
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)
H.head()









    Out[10]:







  
    
      
      _id
      ltable_id
      rtable_id
      id_id_lev_dist
      authors_authors_lev_sim
      year_year_exm
      year_year_anm
      year_year_lev_dist
      year_year_lev_sim
      label
    
  
  
    
      430
      430
      l1494
      r1257
      4
      0.083333
      1
      1.0
      0.0
      1.0
      0
    
    
      35
      35
      l1385
      r1160
      4
      0.271186
      1
      1.0
      0.0
      1.0
      0
    
    
      394
      394
      l1345
      r85
      4
      0.338462
      1
      1.0
      0.0
      1.0
      1
    
    
      29
      29
      l611
      r141
      3
      0.277778
      1
      1.0
      0.0
      1.0
      0
    
    
      181
      181
      l1164
      r1161
      2
      0.244444
      1
      1.0
      0.0
      1.0
      1



In [11]:

    
# Impute feature vectors with the mean of the column values.
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
                strategy='mean')



In [12]:

    
# Fit the decision tree to the feature vectors
dt.fit(table=H, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], target_attr='label')



In [13]:

    
# Use the decision tree matcher to predict if tuple pairs match
dt.predict(table=H, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], target_attr='predicted_labels', 
           return_probs=True, probs_attr='proba', append=True, inplace=True)
H.head()









    Out[13]:







  
    
      
      _id
      ltable_id
      rtable_id
      id_id_lev_dist
      authors_authors_lev_sim
      year_year_exm
      year_year_anm
      year_year_lev_dist
      year_year_lev_sim
      label
      predicted_labels
      proba
    
  
  
    
      430
      430
      l1494
      r1257
      4.0
      0.083333
      1.0
      1.0
      0.0
      1.0
      0
      0
      0.0
    
    
      35
      35
      l1385
      r1160
      4.0
      0.271186
      1.0
      1.0
      0.0
      1.0
      0
      0
      0.0
    
    
      394
      394
      l1345
      r85
      4.0
      0.338462
      1.0
      1.0
      0.0
      1.0
      1
      1
      1.0
    
    
      29
      29
      l611
      r141
      3.0
      0.277778
      1.0
      1.0
      0.0
      1.0
      0
      0
      0.0
    
    
      181
      181
      l1164
      r1161
      2.0
      0.244444
      1.0
      1.0
      0.0
      1.0
      1
      1
      1.0

Debug the ML Matcher

Now we will use the debugger to determine what problems exist with our decision tree matcher.



In [14]:

    
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']



In [15]:

    
# Debug RF matcher using GUI
em.vis_debug_dt(dt, P, Q, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        target_attr='label')

# We see with the debugger that the false negatives have completely different values in the Title attribute.
# This is most likly because we removed all of the features that compare the Title attribute from each table earlier.



In [16]:

    
# We can see which tuples are not predicted correctly
H[H['label'] != H['predicted_labels']]









    Out[16]:







  
    
      
      _id
      ltable_id
      rtable_id
      id_id_lev_dist
      authors_authors_lev_sim
      year_year_exm
      year_year_anm
      year_year_lev_dist
      year_year_lev_sim
      label
      predicted_labels
      proba
    
  
  
    
      371
      371
      l650
      r1594
      4.0
      0.120000
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.500000
    
    
      259
      259
      l938
      r1090
      5.0
      0.200000
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.333333
    
    
      346
      346
      l1681
      r693
      4.0
      0.238095
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.500000
    
    
      184
      184
      l891
      r485
      4.0
      0.137931
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.500000
    
    
      11
      11
      l1189
      r1674
      4.0
      0.222222
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.250000
    
    
      121
      121
      l169
      r521
      4.0
      0.153846
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.500000
    
    
      267
      267
      l120
      r1181
      4.0
      0.216667
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.500000
    
    
      147
      147
      l867
      r1263
      4.0
      0.142857
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.333333

Using Triggers to Improve Results

This, typically involves the following steps:

Creating the match trigger
Adding Rules
Adding a condition status and action
Using the trigger to improve results

Creating the Match Trigger



In [17]:

    
# Use the constructor to create a trigger
mt = em.MatchTrigger()

Adding Rules

Before we can use the rule-based matcher, we need to create rules to evaluate tuple pairs. Each rule is a list of strings. Each string specifies a conjunction of predicates. Each predicate has three parts: (1) an expression, (2) a comparison operator, and (3) a value. The expression is evaluated over a tuple pair, producing a numeric value.



In [18]:

    
# Add two rules to the rule-based matcher

# Since we removed all of the features comparing Title earlier, we want to now add a rule that compares Titles
mt.add_cond_rule(['title_title_lev_sim(ltuple, rtuple) > 0.7'], feature_table)
# The rule has two predicates, one comparing the titles and the other looking for an exact match of the years
mt.add_cond_rule(['title_title_lev_sim(ltuple, rtuple) > 0.4', 'year_year_exm(ltuple, rtuple) == 1'], feature_table)
mt.get_rule_names()









    Out[18]:





['_rule_0', '_rule_1']



In [19]:

    
# Rules can also be deleted from the rule-based matcher
mt.delete_rule('_rule_1')









    Out[19]:





True

Adding a Condition Status and Action

Next, we need to add a condition status and an action to the trigger. Triggers apply the rules added to each tuple pair. If the result is the same value as the condition status, then the action will be carried out.



In [20]:

    
# Since we are using the trigger to fix a problem related to false negatives, we want the condition to be 
# True and the action to be 1. This way, the trigger will set a prediction to 1 when the rule returns True.

mt.add_cond_status(True)
mt.add_action(1)









    Out[20]:





True

Using the Trigger to Improve Results

Now that we have added rules, a condition status, and an action, we can execute the trigger to improve results



In [21]:

    
preds = mt.execute(input_table=H, label_column='predicted_labels', inplace=False)
preds.head()









    Out[21]:







  
    
      
      _id
      ltable_id
      rtable_id
      id_id_lev_dist
      authors_authors_lev_sim
      year_year_exm
      year_year_anm
      year_year_lev_dist
      year_year_lev_sim
      label
      predicted_labels
      proba
    
  
  
    
      430
      430
      l1494
      r1257
      4.0
      0.083333
      1.0
      1.0
      0.0
      1.0
      0
      0
      0.0
    
    
      35
      35
      l1385
      r1160
      4.0
      0.271186
      1.0
      1.0
      0.0
      1.0
      0
      0
      0.0
    
    
      394
      394
      l1345
      r85
      4.0
      0.338462
      1.0
      1.0
      0.0
      1.0
      1
      1
      1.0
    
    
      29
      29
      l611
      r141
      3.0
      0.277778
      1.0
      1.0
      0.0
      1.0
      0
      0
      0.0
    
    
      181
      181
      l1164
      r1161
      2.0
      0.244444
      1.0
      1.0
      0.0
      1.0
      1
      1
      1.0



In [22]:

    
# We were able to significantly reduce the number of incorrectly labeled tuple pairs
preds[preds['label'] != preds['predicted_labels']]









    Out[22]:







  
    
      
      _id
      ltable_id
      rtable_id
      id_id_lev_dist
      authors_authors_lev_sim
      year_year_exm
      year_year_anm
      year_year_lev_dist
      year_year_lev_sim
      label
      predicted_labels
      proba
    
  
  
    
      11
      11
      l1189
      r1674
      4.0
      0.222222
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.25
    
    
      267
      267
      l120
      r1181
      4.0
      0.216667
      1.0
      1.0
      0.0
      1.0
      1
      0
      0.50



In [23]:

    
# We can see that the two tuples that are still labeled incorrectly are due to the title and authors being in the
# wrong column for one of the tuples.
pd.concat([S[S['_id'] == 11], S[S['_id'] == 267]])









    Out[23]:







  
    
      
      _id
      ltable_id
      rtable_id
      ltable_title
      ltable_authors
      ltable_year
      rtable_title
      rtable_authors
      rtable_year
      label
    
  
  
    
      11
      11
      l1189
      r1674
      Weimin Du, Xiangning Liu, Abdelsalam Helal
      Multiview Access Protocols for Large-Scale Replication
      1998
      Multiview access protocols for large-scale replication
      Xiangning Liu, Abdelsalam Helal, Weimin Du
      1998
      1
    
    
      267
      267
      l120
      r1181
      w. Bruce kroft, James callan, erik w. Brown
      fast incrremental indexiing for fulltext informtion retreval
      1994
      Fast Incremental Indexing For Full-Text Information Retrieval
      Eric W. Brown, James P. Callan, W. Bruce Croft
      1994
      1



In [ ]:

	_id	ltable_id	rtable_id	ltable_title	ltable_authors	ltable_year	rtable_title	rtable_authors	rtable_year	label
0	0	l1223	r498	Dynamic Information Visualization	Yannis E. Ioannidis	1996	Dynamic information visualization	Yannis E. Ioannidis	1996	1
1	1	l1563	r1285	Dynamic Load Balancing in Hierarchical Parallel Database Systems	Luc Bouganim, Daniela Florescu, Patrick Valduriez	1996	Dynamic Load Balancing in Hierarchical Parallel Database Systems	Luc Bouganim, Daniela Florescu, Patrick Valduriez	1996	1
2	2	l1514	r1348	Query Processing and Optimization in Oracle Rdb	Gennady Antoshenkov, Mohamed Ziauddin	1996	prospector: a content-based multimedia server for massively parallel architectures	S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader	1996	0
3	3	l206	r1641	An Asymptotically Optimal Multiversion B-Tree	Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger	1996	A complete temporal relational algebra	Debabrata Dey, Terence M. Barron, Veda C. Storey	1996	0
4	4	l1589	r495	Evaluating Probabilistic Queries over Imprecise Data	Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar	2003	Evaluating probabilistic queries over imprecise data	Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar	2003	1

	feature_name	left_attribute	right_attribute	left_attr_tokenizer	right_attr_tokenizer	simfunction	function	function_source	is_auto_generated
0	id_id_lev_dist	id	id	None	None	lev_dist	<function id_id_lev_dist at 0x11b874aa0>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
1	id_id_lev_sim	id	id	None	None	lev_sim	<function id_id_lev_sim at 0x11b874d70>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
2	id_id_jar	id	id	None	None	jaro	<function id_id_jar at 0x11b874a28>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
3	id_id_jwn	id	id	None	None	jaro_winkler	<function id_id_jwn at 0x11b874c80>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
4	id_id_exm	id	id	None	None	exact_match	<function id_id_exm at 0x11b874de8>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
5	id_id_jac_qgm_3_qgm_3	id	id	qgm_3	qgm_3	jaccard	<function id_id_jac_qgm_3_qgm_3 at 0x11b874e60>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
6	title_title_jac_qgm_3_qgm_3	title	title	qgm_3	qgm_3	jaccard	<function title_title_jac_qgm_3_qgm_3 at 0x11b889050>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
7	title_title_cos_dlm_dc0_dlm_dc0	title	title	dlm_dc0	dlm_dc0	cosine	<function title_title_cos_dlm_dc0_dlm_dc0 at 0x11b8890c8>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
8	title_title_mel	title	title	None	None	monge_elkan	<function title_title_mel at 0x11b889140>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
9	title_title_lev_dist	title	title	None	None	lev_dist	<function title_title_lev_dist at 0x11b8891b8>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
10	title_title_lev_sim	title	title	None	None	lev_sim	<function title_title_lev_sim at 0x11b889230>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
11	authors_authors_jac_qgm_3_qgm_3	authors	authors	qgm_3	qgm_3	jaccard	<function authors_authors_jac_qgm_3_qgm_3 at 0x11b8892a8>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
12	authors_authors_cos_dlm_dc0_dlm_dc0	authors	authors	dlm_dc0	dlm_dc0	cosine	<function authors_authors_cos_dlm_dc0_dlm_dc0 at 0x11b889320>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
13	authors_authors_mel	authors	authors	None	None	monge_elkan	<function authors_authors_mel at 0x11b889398>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
14	authors_authors_lev_dist	authors	authors	None	None	lev_dist	<function authors_authors_lev_dist at 0x11b889410>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
15	authors_authors_lev_sim	authors	authors	None	None	lev_sim	<function authors_authors_lev_sim at 0x11b889488>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
16	year_year_exm	year	year	None	None	exact_match	<function year_year_exm at 0x11b889500>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
17	year_year_anm	year	year	None	None	abs_norm	<function year_year_anm at 0x11b889578>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
18	year_year_lev_dist	year	year	None	None	lev_dist	<function year_year_lev_dist at 0x11b8895f0>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True
19	year_year_lev_sim	year	year	None	None	lev_sim	<function year_year_lev_sim at 0x11b889668>	from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...	True

	_id	ltable_id	rtable_id	id_id_lev_dist	authors_authors_lev_sim	year_year_exm	year_year_anm	year_year_lev_sim	label
430	430	l1494	r1257	4	0.083333	1	1.0	1.0	0
35	35	l1385	r1160	4	0.271186	1	1.0	1.0	0
394	394	l1345	r85	4	0.338462	1	1.0	1.0	1
29	29	l611	r141	3	0.277778	1	1.0	1.0	0
181	181	l1164	r1161	2	0.244444	1	1.0	1.0	1

	_id	ltable_id	rtable_id	id_id_lev_dist	authors_authors_lev_sim	year_year_exm	year_year_anm	year_year_lev_sim	label	proba
371	371	l650	r1594	4.0	0.120000	1.0	1.0	1.0	1	0.500000
259	259	l938	r1090	5.0	0.200000	1.0	1.0	1.0	1	0.333333
346	346	l1681	r693	4.0	0.238095	1.0	1.0	1.0	1	0.500000
184	184	l891	r485	4.0	0.137931	1.0	1.0	1.0	1	0.500000
11	11	l1189	r1674	4.0	0.222222	1.0	1.0	1.0	1	0.250000
121	121	l169	r521	4.0	0.153846	1.0	1.0	1.0	1	0.500000
267	267	l120	r1181	4.0	0.216667	1.0	1.0	1.0	1	0.500000
147	147	l867	r1263	4.0	0.142857	1.0	1.0	1.0	1	0.333333

	_id	ltable_id	rtable_id	ltable_title	ltable_authors	ltable_year	rtable_title	rtable_authors	rtable_year	label
11	11	l1189	r1674	Weimin Du, Xiangning Liu, Abdelsalam Helal	Multiview Access Protocols for Large-Scale Replication	1998	Multiview access protocols for large-scale replication	Xiangning Liu, Abdelsalam Helal, Weimin Du	1998	1
267	267	l120	r1181	w. Bruce kroft, James callan, erik w. Brown	fast incrremental indexiing for fulltext informtion retreval	1994	Fast Incremental Indexing For Full-Text Information Retrieval	Eric W. Brown, James P. Callan, W. Bruce Croft	1994	1